Visualization of Large Complex Data
Steve Elston
12/26/2020
Visualizing Large Complex Data is Difficult
Problem: Modern data sets are growing in size and complexity
Goal: Understand key relationships in large complex data sets
Difficulty: Large data volume
- Modern computational systems have massive capacity
- Example: Use map-reduce algorithms on cloud clusters
Difficulty: Large numbers of variables
- Huge number of variables with many potential relationships
- This is the hard part!
Limitation of Scientific Graphics
All scientific graphics are limited to a 2-dimensional projection
But, complex data sets have a great many dimensions
We need methods to project large complex data onto 2-dimensions
Generally, multiple views are required to understand complex data sets
- Don’t expect one view to show all important relationships
- Develop understanding over many views
- Try many views, don’t expect most to be very useful
Scalable Chart Types
Some chart types are inherently scalable.
- Bar plots: Counts can be computed; e.g. use map-reduce
- Histograms: Data is binned in parallel
- Box plots: Finding the quartiles is a scalable counting process
- KDE and violin plots: Similarly to the box plot, using kernel density estimation
Over-plotting
Over-plotting occurs in plots when the markers lie one on another.
- Common, even in relatively small data sets
- Scatter plots can look like a blob and be completely uninterpretable
- Over-plotting is a significant problem in EDA and presentation graphics
Dealing with Over-plotting
What can we do about over-plotting?
- Marker transparency: so one can see markers underneath; useful in cases with minimal overlap of markers
- Marker size: smaller marker size reduces over-plotting within limits
- Adding jitter: adding a bit of random jitter to variables with limited number of values
Example of Overplotting
<<<<<<< HEAD
<<<<<<< HEAD

Use Transparency, Marker Size, Downsampling

=======

Use Transparency, Marker Size, Downsampling

>>>>>>> e138d62bd4a809b5a942d13cb0cde77376add9d3
=======

Use Transparency, Marker Size, Downsampling

>>>>>>> temp
Other Methods to Display Large Data Sets
Alternatives to avoid over-plotting for truly large data sets
- Hex bin plots: the 2-dimensional equivalent of the histogram
- Frequency of values is tabulated into 2-dimensional hexagonal bins
- Displayed using a sequential color palette
- 2-d kernel density estimation plots: natural extension of the 1-dimensional KDE plot
- Good for moderately large data
- Heat map: values of one variable against another
- Categorical (count) or continuous variables
- Carefully choose color pallet, sequential or divergent
- Mosaic plots: display multidimensional count (categorical) data
- Uses tile size and color to project multiple dimensions
- 2-d equivalent of a multi-variate bar chart
- Dimensionality reduction: we will discuss this later in the course
Hexbin Plot

Countour Plot
<<<<<<< HEAD
<<<<<<< HEAD

=======

>>>>>>> e138d62bd4a809b5a942d13cb0cde77376add9d3
=======

>>>>>>> temp
Other Methods to Display Large Data Sets
Sometimes a creative alternative is best
Often situation specific; many possibilities
Finding a good one can require significant creativity!
Example, choropleth for mutli-dimensional geographic data
Example, time series of box plots
Time Series of Box Plots

Displays for Complex Data
How can we understand the relationships in complex data with many variables?
Arrays of plots: subsets show relationships in a complex data set
Pairwise scatter plots: matrix of all pairwise combinations of variables
- Project additional dimensions with plot aesthetics
- pairwise scatter plots can be created for subsets of large and complex data sets.
Faceting: uses values of categorical or numeric variables to plot subsets
- Subsets are displayed on an array of plots
- Typically use axes on same scale to ensure correct perception of relationships
- Faceting goes by several other monikers, conditional plotting, method of small multiples, lattice plotting
Cognostics: sort large number of variables to find important relationships
Arrays of Plots
Display multiple plot views in an array or grid
- Create an array of plots which project multiple related views of data relationships
- Organize axes to give multi-dimensional view
- Example, scatterplot with kde plots on the margins
- Supported by Seaborn jointplot
Scatter Plot Matrix
Scatter plot matrix used to investigate relationships between a number of variables
- Key idea: Display a scatter plots of each variable versus all other variables
- Primarily EDA tool
- Conveys lots of information - requires study!
- Each pairwise relationship is displayed twice
- Two possible orientations
- Or two different plot types
- Can place histograms and KDE plots on diagonal
Scatter Plot Matrix
<<<<<<< HEAD
<<<<<<< HEAD

=======

>>>>>>> e138d62bd4a809b5a942d13cb0cde77376add9d3
=======

>>>>>>> temp
Facet Plots
Facet plots revolutionized statistical graphics starting about 30 years ago
Facet Plots
Like many good ideas facet plotting was invented serveral times
- Multiple contemporaneous inventors and names
- Tufte, 1990, introduced method of small multiples
- Cleveland, 1992, introduced trellis plotting
- Also known as conditioned plots
- Most packages use term facet plot
Facet Plot with Weather by Season

Congnostics
How can we visualize very high dimensional data?
Modern data sets have thousands to millions of variables
- Cannot possibly look at all of these
Idea: need to find the most important relationships
Use a cognostic to sort relationship
- Cognostic is a statistic to sort data
- Sort the variables or relationships by the cognostic
- Plot relationships with most interesting cognostic
Idea originally proposed by Tukey, 1982, 1985
Cognistic: Counties With Fastest Rate of Housing Price Increase
## C:\Users\asano\ANACON~1\lib\site-packages\pandas\core\indexing.py:1745: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
## isetter(ilocs[0], value)
## 0 DC
## 1 WY
## 2 ND
## 3 MT
## 4 OK
## 5 VT
## 6 MI
## 7 MN
## Name: entity_name, dtype: object
<<<<<<< HEAD
<<<<<<< HEAD

=======

>>>>>>> e138d62bd4a809b5a942d13cb0cde77376add9d3
=======

>>>>>>> temp
Summary
We have explored these key points
Proper use of plot aesthetics enable projection of multiple dimensions of complex data onto the 2-dimensional plot surface.
All plot aesthetics have limitations which must be understood to use them effectively
The effectiveness of a plot aesthetic varies with the type and the application
Visualization of modern data sets, growing in size and complexity
Visualization limited by 2-dimensional projection
Goal: Understand key relationships in large complex data sets
Difficulty: Large data volume
- Modern computational systems have massive capacity
- Example: Use map-reduce algorithms on cloud clusters
Difficulty: Large numbers of variables
- Huge number of variables with many potential relationships
- This is the hard part!